Machine Learning Bolsters Evidence That D1, Nef, and Tat Influence HIV Reservoir Dynamics

Background: The primary hurdle to curing HIV is due to the establishment of a reservoir early in infection. In an effort to find new treatment strategies, we and others have focused on understanding the selection pressures exerted on the reservoir by studying how proviral sequences change over time. Methods: To gain insights into the dynamics of the HIV reservoir we analyzed longitudinal near full-length sequences from 7 people living with HIV between 1 and 20 years following the initiation of antiretroviral treatment. We used this data to employ Bayesian mixed effects models to characterize the decay of the reservoir using single-phase and multiphasic decay models based on near full-length sequencing. In addition, we developed a machine-learning approach utilizing logistic regression to identify elements within the HIV genome most associated with proviral decay and persistence. By systematically analyzing proviruses that are deleted for a specific element, we gain insights into their role in reservoir contraction and expansion. Results: Our analyses indicate that biphasic decay models of intact reservoir dynamics were better than single-phase models with a stronger statistical fit. Based on the biphasic decay pattern of the intact reservoir, we estimated the half-lives of the first and second phases of decay to be 18.2 (17.3 to 19.2, 95%CI) and 433 (227 to 6400, 95%CI) months, respectively. In contrast, the dynamics of defective proviruses differed favoring neither model definitively, with an estimated half-life of 87.3 (78.1 to 98.8, 95% CI) months during the first phase of the biphasic model. Machine-learning analysis of HIV genomes at the nucleotide level revealed that the presence of the splice donor site D1 was the principal genomic element associated with contraction. This role of D1 was then validated in an in vitro system. Using the same approach, we additionally found supporting evidence that HIV nef may confer a protective advantage for latently infected T cells while tat was associated with clonal expansion. Conclusions: The nature of intact reservoir decay suggests that the long-lived HIV reservoir contains at least 2 distinct compartments. The first compartment decays faster than the second compartment. Our machine-learning analysis of HIV proviral sequences reveals specific genomic elements are associated with contraction while others are associated with persistence and expansion. Together, these opposing forces shape the reservoir over time.


Mock Experiment to Explain Force Factor Calculation. (A)
To illustrate how Force Factors are calculated we created a genome that consists of four elements (red, blue, black and green).In this mock experiment, we sequenced five genomes at each of three different time points (T1, T2 and T3).Similar to the intact and defective nature of Cannon L, Fehrman S, Pinzone M, Weissman S, O'Doherty U. Machine Learning Bolsters Evidence That D1, Nef, and Tat Influence HIV Reservoir Dynamics.Pathogens and Immunity.2024;8(2):37-58.doi: 10.20411/pai.v8i2.621actual HIV genomes in vivo, our genomes can be intact, containing all four genomic elements, or they can be defective and only contain a subset of the four elements.
(B) For this mock analysis, we selected all two-element combinations of the four elements for a total of 6 possible combinations.(C1, C2, C3, C4, C5, C6).For each combination, we determined the proportion of genomes at each time point that contain the elements in the combination and then performed logistic regression to determine the rate parameter (slope).For ease of understanding, element proportions and linear slopes are shown in the figure .The elements that are associated with decay will have a negative slope and elements that are protective will be associated with a positive slope.For each combination, we calculate the corresponding slope (β1, β2, β3, β4, β5, β6).(C) We prepare a table of every combination with their corresponding slope and rank them from highest to lowest slope.We then focused on the combinations that had the biggest effect by studying the extremes.For this example, we consider the combinations in the lower ~30% and upper ~30% in terms of their slopes (i.e., the two combinations with the highest slopes and the two combinations with the lowest slopes).(D) To calculate the force factor for a given element we count the number of occurrences of that element in lower group and subtract from it the number of occurrences of that element in the upper group then divide by the total amount of combinations in one group.For example, calculation of the force factor for the blue element and green elements are: The force factor ranges from -1 to 1.The closer a force factor is to -1 the more the element is related to decay.Conversely, the closer a force factor is to 1 the more the element is related to persistence.In this simple example the blue element has a force factor of -1 and therefore would be heavily associated with decay and the green element has a force factor of 1 and is therefore associated with persistence.All elements that were required to define a provirus as intact are indicated in bold with asterisks.
The Trans-activation response element was not included as we did not capture its entire sequence with our cloning strategy.We accepted both the canonical D1 sequence (GGTRAGT) as well as a GT dinucleotide cryptic donor site located four nucleotides downstream from D1.

Criteria for Excluding Sequences:
To avoid ambiguous nucleotides due to low coverage at each end, we analyzed the region of each sequence from 20 nucleotides downstream of the 5' end primer to 20 nucleotides upstream of the 3' end primer.
On rare occasions, proviruses were excluded from analysis due to technical limitations.
Specifically, proviruses were excluded based on the following: 1) Poor read coverage leading to assembly failure of consensus sequence.
2) Reads were determined to originate from more than one provirus determined by the following criteria: • Dinucleotide calls (>5%) within the aligned reads, suggesting more than one provirus was present during PCR amplification.Exceptions to this rule included insertions of additional adenosine nucleotides at the beginning/end of chains with at least 5 consecutive adenosine nucleotides as well as other dinucleotide calls appearing with frequency consistent with PCR error during any of the round of amplification.
• Regions with sharp drops in coverage suggesting the presence of both a provirus with a deletion and at least one or more without a deletion.

Motifs and ORFs Identification
Sequence reads from each provirus were de novo assembled to generate a consensus sequence of each proviral genome.All possible ORFs were annotated within the assembled genomes by searching for the canonical start codon sequence ATG and extending the ORF until a stop codon was reached.The non-canonical start codon TTT was used to identify the pol gene.To be labeled as an intact HIV ORF, we required that the AUG or TTTTTT (for pol) and the stop codon to be present within 20 nucleotides of the ORF in HXB2 without premature stop codons.To identify Tat and Rev, exons 1 and 2 of Tat and Rev were annotated to the provirus genome based on 65% homology with the HXB2 Tat and Rev 1 and 2. These Tat 1/2 and Rev 1/2 homologous sequences of the provirus were then extracted, concatenated, and translated.The Tat and Rev sequences were considered intact if the sequences had no early stop codons and retained the proper stop codon.We accepted known early stops variants of Tat.
Table showing the p values and mean force factors for each element when all data is utilized, clones are removed, and only the large clones are considered.*Significant Elements* Supplemental Figure 2: Force Factor Null Distributions The P values were calculated by randomly permuting the elements in each provirus at each time point and calculating the resulting force factors.The null distribution is shown in each plot with the significant elements at their respective force factors for A) when all proviruses are considered, B) the clonal proviruses are removed, and C) when only the clonal proviruses are considered.

621 Supplemental Table 1: List of the Sequences of Splice and Packaging Sites Used to Annotate the Sequenced Proviruses
Cannon L, Fehrman S, Pinzone M, Weissman S, O'Doherty U. Machine Learning Bolsters Evidence That D1, Nef, and Tat Influence HIV Reservoir Dynamics.Pathogens and Immunity.2024;8(2):37-58.doi: 10.20411/pai.v8i2.polLocation of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides vif Location of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides vpr Location of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides vpu Location of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides env Location of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides nef Location of ORF as defined by HXB2.Start and stop codon +/-20 nucleotides rev Location of ORF as defined by HXB2.Start and stop codon +/-

Table 4 : Decay Parameter Estimates Single Phase Intact Defective
Table showing the intact and defective sequences analyzed at each timepoint for each chronically treated (CT) study participant.Table showing parameter estimates for reservoir dynamics analysis.Decay rates are given for fits for the single-phase model the and biphasic model for both cases when all data is considered and when clones are reduced.